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Abstract 

We introduce two algorithms for accurately evaluating powers to a positive inte- 
ger in floating-point arithmetic, assuming a fused multiply- add (fma) instruction is 
available. We show that our log-time algorithm always produce faithfully-rounded 
results, discuss the possibility of getting correctly rounded results, and show that 
results correctly rounded in double precision can be obtained if extended-precision is 
available with the possibility to round into double precision (with a single rounding) . 

! 1 Introduction 

on 

We deal with the implementation of the integer power function in floating-point arith- 
■^j- ■ metic. In the following, we assume a radix-2 floating-point arithmetic that follows the 
iy-^ . IEEE- 754 standard for floating-point arithmetic. We also assume that a fused multiply- 
and-add (fma) operation is available, and that the input as well as the output values of 
q \ the power function are not subnormal numbers, and are below the overflow threshold (so 
that we can focus on the powering of the significands only). 

An important case dealt with in the paper will be the case when an internal format, 
wider than the target format, is available. For instance, to guarantee - in some cases - 
correctly rounded integer powers in double precision arithmetic, we will have to assume 
that a double-extended precision is available. The examples will consider that it has a 
64-bit precision, which is the minimum required by the IEEE- 754 standard. 

The IEEE- 754 standard pQ for radix-2 floating-point arithmetic (and its follower, the 
IEEE-854 radix- independent standard [3]) require that the four arithmetic operations and 
the square root should be correctly rounded. In a floating-point system that follows the 
standard, the user can choose an active rounding mode from: 

• rounding towards — oo: RD (x) is the largest machine number less than or equal to 
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• rounding towards +00: RU (x) is the smallest machine number greater than or 
equal to x; 

• rounding towards 0: RZ (x) is equal to RD (x) if x > 0, and to RU (x) if x < 0; 

• rounding to nearest: RN (x) is the machine number that is the closest to x (with a 
special convention if x is exactly between two machine numbers: the chosen number 
is the "even" one, i.e., the one whose last significand bit is a zero). 

When aob is computed, where a and b are floating-point numbers and o is +, — , x or 
-T-, the returned result is what we would get if we computed 006 exactly, with "infinite" 
precision and rounded it according to the active rounding mode. The default rounding 
mode is round-to- nearest. This requirement is called correct rounding. Among its many 
interesting properties, one can cite the following result (the first ideas that underlie it go 
back to M0ller [ID]). 

Theorem 1 (Fast2Sum algorithm) (Theorem C ofJEj, page 236). Assume the radix r 
of the floating-point system being considered is less than or equal to 3, and that the used 
arithmetic provides correct rounding with rounding to nearest. Let a and b be floating- 
point numbers, and assume that the exponent of a is larger than or equal to that ofb. The 
following algorithm computes two floating-point numbers s and t that satisfy: 

• s + t = a + b exactly; 

• s is the floating-point number that is closest to a + b. 
Algorithm 1 (Fast2Sum(a,b)) 

s := RN(a + b); 
z := RN(s-a); 
t := RN{b-z); 

If no information on the relative orders of magnitude of a and b is available, there is 
an alternative algorithm introduced by Knuth j6]. It requires 6 operations instead of 3 
for the Fast2Sum algorithm, but on any modern computer, the 3 additional operations 
cost significantly less than a comparison followed by a branching. 

Some processors (e.g., the IBM PowerPC or the Intel/HP Itanium [2]) have a fused 
multiply-add (fma) instruction that allows to compute ax ± b, where a, x and b are 
floating-point numbers, with one final rounding only. This instruction allows one to 
design convenient software algorithms for correctly rounded division and square root. It 
also has the following interesting property. From two input floating-point numbers a and 
b, the following algorithm computes c and d such that c+d = ab, and c is the floating-point 
number that is nearest ab. 
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Algorithm 2 (Fast2Mult(a,b)) 

c := RN(ab); 
d := RN(ab-c); 

Performing a similar calculation without a fused multiply-add operation is possible [3] 
but requires 17 floating-point operations instead of 2. 

Algorithms Fast2Sum and Fast2Mult both provide double-precision results of value 
(x + y) represented in the form of pairs (x, y) . In the following we need product of 
numbers represented in this form. However, we will be satisfied with approximations to 
the product, discarding terms of the order of the product of the two low-order terms. Given 
two double-precision operands (a h + ai) and (b h + bi) the following algorithm DblMult 
computes (x, y) such that (x + y) — [{ah + ai){b h + + 5) where the relative error 5 
is discussed in Section 3 below. 



Algorithm 3 (Db\Mu\t(a h ,ai,b h ,bi)) 

t 

s 

(x', u) 



(x", v) 



V 



{x,y) 



RN (a,6 h ); 
RN (a h bi + t); 
Fast2Mult (a h , b h ) 
Fast2Sum (x', s); 
RN (u + v); 
Fast2Sum (x", y')\ 



Note that the condition for applying Fast2Sum is satisfied. 



2 The two algorithms 

We now give two algorithms for accurately computing x n , where x is a floating-point 
number, and n is an integer greater than or equal to 1. We assume that an fma instruction 
is available, as it is used in Fast2Mult and thus implicitly also in DblMult. 

The first {0(n) time) algorithm is derived from the straightforward, (n — 1)- 
multiplication, algorithm. It is simple to analyze and will be faster than the other one if 
n is small. 



Algorithm 4 (LinPower(x, n), n > 1) 

(M):=M); 

for % from 2 to n do 

(h,v) := Fast2Mult(h,x); 
I := RN(lx + v); 
end do; 
return (h, I); 

where the low order terms are accumulated with appropriate weights using a Horner 
scheme evaluation. Algorithm LinPower uses 3n — 3 floating-point operations. 
The second ((9(log(n))-time) algorithm is based on successive squarings. 
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Algorithm 5 (LogPower(a;, n), n > 1) 



i := n; 



(h,l) := (1,0); 
:= (x,0); 



while z > 1 do 



if (i mod 2) = 1 then 



(h,l) := DblMult(h,l,u,v); 



end; 



(w,i>) := DblMult(u,v,u,v); 
i:= [i/2j; 
end do; 

return DblMult(h,l,u,v); 



Due to the approximations performed in algorithm DblMult, terms corresponding to 
the product of low order terms are not included. A thorough error analysis is performed 
below. The number of floating-point operations used by the LogPower algorithm is be- 
tween 11(1+ |_l°g2( n )J) an d 11(1 + 2 |_log 2 (n)J), whereas for LinPower it is 3(n — 1). Hence, 
LogPower will become faster than LinPower for values of n around 30 (but counting the 
floating-point operations only gives a rough estimate, the actual threshold will depend on 
the architecture and compiler). 

3 Error analysis 

We will use the following result. 

Theorem 2 (Theorem 2.2 of [4|, p. 38) Assume a radix-r floating-point system F, 
with precision p. If x G R lies in the range of F, then 



Theorem 3 Let e = 2~ p , where p is the precision of the radix-2 floating-point system 
used. If \ai\ < 2~ p \ah\ and \bi\ < 2~ p \bh\ then the returned value (x,y) of function 
DblMult(dh, ai,bh,bi) satisfies 




-p+i 



3.1 Error of function DblMult 



x + y = (a h + ai)(b h + 6j)(l + 77), 



with 



r]\ <6e 2 + 16e 3 + 17e 4 + lle 5 



+ 5e 6 + e 7 . 



Notes: 



1. as soon as p > 5, we have \rj\ < 7e 2 ; 

2. in the case of single precision (p = 24), \r]\ < 6.000001e 2 ; 

3. in the case of double precision (p = 53), \r]\ < (6 + 2 x 10 -15 ) e 2 . 
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Proof: Following the notation in Algorithm 5, with e^'s being variables of absolute value 
less than e, we have 

x + y = x" + RN (u + v) 

= x" + (u + v)(l + e l ) 

= (x" + v) + U + U6i + V€i 
= x' + S + U + U€\ + V€i 

= a h b h + s + uei + we! 

= a/A + [a h k + (aj6 h )(l + e 3 )](l + e 2 ) + uei + »i 

= a/A + a/A + aib h + a h bie 2 + a x b h e 2 + aib h e 2 e 3 + a A £3 + uei + uei. 

We also have a; = £40/1, bi = e 5 bh, u = e 6 ahbh, and 

v = e 7 (x' + s) 

= £7 (oA(l + e 8 ) + [a/A + aib h (l + e 3 )](l + e 2 )) 

= e 7 (a ft 6 ft (l + e 8 ) + [e 5 a h b h + e±a h b h (l + e 3 )](l + e 2 )) 

= e 7 a fc & h (1 + e 8 + e 5 + e 2 e 5 + e 4 + e 2 e 4 + e 3 e 4 + e 2 e 3 e 4 ) 

= r)ia h bh, 

with l^l < e + 3e 2 + 3e 3 + e 4 . Hence 

£ + y = ahbh + a h bi + aA + (aj&j - e A e 5 a h b h ) + a h b h (e 2 e 5 + e 2 e 4 + e 2 e 3 e 4 + e 3 e 4 + e 4 e 6 + 77161) 
= (a h + cn)(b h + bi) + a h b h rj 2 , 

with |r/ 2 | < 6e 2 + 4e 3 + 3e 4 + e 5 . 

Now, from = (a h + a;)(l + e 9 ) and b h = {b h + b{)(l + e w ) we deduce 

x + y=(a h + ai)(b h + 6 { )(1 + 77), 

with 77 = (1 + e) 2 ?7 2 , which gives H < 6e 2 + 16e 3 + 17e 4 + lie 5 + 5e 6 + e 7 . □ 

3.2 Error of algorithm LogPower 

Theorem 4 The two values h and I returned by algorithm LogPower satisfy 

h + l = x n (l + a), 

with 

(i-H) n_l <i+a< (i + Mr -1 

where \r)\ < 6e 2 + 16e 3 + 17e 4 + lie 5 + 5e 6 + e 7 is the same value as in Theorem^ 

Proof: Algorithm LogPower computes approximations to powers of x, using x l+ i = x l x^ . 
By induction, one easily shows that the approximation to x k is of the form x k (l + /3k), 
where (1 — |?7|) fc_1 < (1+^) < (l + |77|) fc_1 . If we call r] i+ j the relative error (obtained from 
Theorem [3]) when multiplying together the approximations to x l and x- 7 , the induction 
follows from 

(l-r 7 ) i - 1 (l-7 7 y- 1 (l-77) < + A)) + /?;)) (l+i^) < (l+vY-'il+vy-'il+v)- 

□ 
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Tabled] gives bounds on \a\ for several values of n (note that the bound is an increasing 
value of n), assuming the algorithm is used in double precision. 
Define the significand of a non-zero real number u to be 

u 

2L lo §2 WW ' 

Define a max as the bound on \a\ obtained for a given value of n. From 

x n (l - a max ) < h + l < x n (l + a max ), 

we deduce that the significand of h + I is within 2a max from x n /2^ og2 \ h + l \\ . From the 
results given in Table (TJ we deduce that for all practical values of n the significand of h + l 
is within much less than 2 -53 from x n /2^- log2 \ h + l \\ (indeed, to get 2a max larger that 2 -53 , 
we need n > 2 49 ). This means that RN (h + /) is within less than one ulp from x n , hence 

Theorem 5 If algorithm LogPower is implemented in double precision, then RN (h + I) 
is a faithful rounding of x n , as long as n < 2 49 . 



n 


-log 2 (a max ) 


n 


-log 2 (a max ) 


3 


102.41 


1000 


93.45 


4 


101.83 


10,000 


90.12 


5 


101.41 


100,000 


86.80 


10 


100.24 


1,000,000 


83.48 


20 


99.16 


10,000,000 


80.16 


30 


98.55 


100,000,000 


76.83 


40 


98.12 


2 32 


71.41 


50 


97.80 






100 


96.78 






200 


95.77 







Table 1: Binary logarithm of the relative accuracy (— \og 2 (ct max )) , for various values of n 
assuming algorithm LogPower is used in double precision. 

Moreover, for n < 10 8 , RN (h + /) is within 0.50000007 ulps from the exact value: we 
are very close to correct rounding (indeed, we almost always return a correctly rounded 
result), yet we cannot guarantee correct rounding, even for the smallest values of n. This 
requires a much better accuracy, as shown in Section HI To guarantee a correctly rounded 
result in double precision, we will need to run algorithm LogPower in double-extended 
precision. Table [2] gives bounds on |a| for several values of n assuming the algorithm is 
realized in double-extended precision. As expected, we are 22 bits more accurate. 

3.3 Error of algorithm LinPower 

Define hi, Vi, k as the values of variables h, v and I at the end of the loop of index i of the 
algorithm. Define U as the value variable U would have if the instructions I := RN (Ix + v) 
were errorless (that is, if instead we had / := (Ix + v) exactly): 

k = v i + Vi^ix + Vi_ 2 x 2 + ^-3^ 3 H 1- v 2 x l ~ 2 . (1) 
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Table 2: Binary logarithm of the relative accuracy (— \og 2 (a max )) , for various values of n 
assuming algorithm LogPower is implemented in double-extended precision. 

Initially let h\ = x, v\ — By induction, one can easily show that 

x % = hi + v i + Vi_ix + Vi_ 2 x 2 + Vi_ 3 x 3 H + v 2 x l ~ 2 , (2) 

hence we have 

The algorithm only computes an approximation U to U. To evaluate the error of the 
algorithm, we must therefore estimate the distance between U and ij. We have h = h = 0, 
and I2 — I2 — V2 exactly. Define 6{ as the number of absolute value less than e = 2~ p such 
that 

k = RN (k-xx + Vi) = (k-ix + Vi)(l + €i). 
We have Z 3 = Z 3 (l + e 3 ), and by induction, we find for i > 4, using — k — k-ix: 

U = k(l + 6i) 

+ Z i _ 1 e i _irc(l + Ci) 

+ fi- 2 ei- 2 ^ 2 (l + ei-i)(l + ei) (3) 

+ / 3 e 3 ^ 3 (l + e 4 )(l + e 5 ) ■ • • (1 + ^(l + e<). 

To derive a useful bound from this result, we must make a simplifying hypothesis. We 
know that \vi\ < e|/ij|. We assume hi is close enough to x l , so that 

< 2e|a;| i 

(this means that our estimate for x n will become wrong when the algorithm becomes very 
inaccurate for x\ i < n). From ([T]), we therefore have: 

\k\ < 2(i - l)e|xp, 

from which, using Q, we deduce 



7 



where 

\v\ < 2\x\ n e 2 [{n - 1) + (n - 2)(1 + e) + (n - 3)(1 + e) 2 + ■ • • + 2(1 + e)™" 3 ] . (4) 
This gives the following result 

Theorem 6 (Accuracy of algorithm LinPower) If fori < n, \v{\ < 2 1 ~ p \x\' l ! the final 
computed values h n and l n of the variables h and I of the algorithm satisfy 

h n + l n = x n (l + a), 

where \a\ < 2e 2 [(n - 1) + (n - 2)(1 + e) + (n - 3)(1 + e) 2 + • • • + 2(1 + e)™" 3 ] . 

Let us try to compute an estimate of the coefficient 7 = (n — 1) + (n — 2)(1 + e) + 
(n - 3)(1 + e) 2 + • • • + 2(1 + e) n ~ 3 in a. 
Define a function 

(p{t) = t n ~ l + (1 + e)t n ~ 2 + (1 + e) 2 t"- 3 + ■ ■ • + (1 + e) n ~H 2 . 

One can notice that 7 = <p'(l), so that if we are able to find a simple formula for <p(t) we 
will be able to deduce a formula for 7. We have 



<p(t) = (l + e) 



n-l 



1 + e 



n-l 



1 + e 



n-2 



hence 



= (1 + c) 



n-l 



+ ■ ■ ■ + 



- 1 



1 + e 



1 + e 



Thus 



<f/{t) = (l + e) 



n-2 



("-i)(+r-"(+r'+i 1 
(A - 1) 2 

Hence a bound on the value of lal is, 



lad <2e 2 (l + e) n " 2 



(rfc-i) 2 



(n 2 - n - 2)e 2 . 



Table [3] gives the obtained bound on \a\ for several values of n, assuming double 
precision (e = 2 -53 ). That table shows that as soon as n is larger than a few units, 
algorithm LinPower is less accurate than algorithm LogPower. 



4 Correct rounding 

In this section we consider algorithm LogPower only: first because it is the fastest for 
all reasonable values of n, second because it is the only one for which we have certain 
error bounds (the error bounds of algorithm LinPower are approximate only). And if 
needed, specific algorithms could be designed for small values of n. We are interested 
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Table 3: Binary logarithm of the relative accuracy (— log 2 (a max )), for various values of n 
assuming algorithm LinPower is implemented in double precision. 

in getting correctly rounded results in double precision. To do so, we assume that we 
perform algorithm LogPower in double extended precision. The algorithm returns two 
double-extended numbers h and I such that 

x n (l - a max ) < h + l < x n (l + a ma x), 

where a max is given in Table [2 

In the following we will need to distinguish two roundings, e.g., -RiV e means round- 
to-nearest in extended double precision and RNj is round-to-nearest in double precision. 
Let ulp(-) denote "unit-in-last-position" such that \x — RN (x)\ < |ulp(x). 

V. Lefevre introduced a new method for finding hardest-to-round cases for evaluating 
a regular function [Hl[7j. That method allowed Lefevre and Muller to give such cases for 
the most familiar elementary functions [9]. Recently, Lefevre adapted his method to the 
case of functions x n and x 1 /™, when n is an integer. For instance, in double-precision 
arithmetic, the hardest to round case for function x 51 corresponds to 

x = 1.0100010111101011011011101010011111100101000111011101 

we have 

x 51 = 1.101100111010010001110010000110010000010110101110111Q 1 

53 bits 

0000000000 ■ • • 0000000000 100 • • • x 2 17 

V v ' 

59 zeros 

which means that x n is extremely close to the exact middle of two consecutive double- 
precision numbers. There is a run of 59 consecutive zeros after the rounding bit. This 
case is the worst case for all values of n between 3 and 145. Table H] gives the maximal 
length of the chains of identical bits after the rounding bit for 3 < n < 145. 
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n 


Number of identical bits 
after the rounding bit 


32 


48 


76,81,85 


49 


9, 15, 16, 31, 37, 47, 54, 55, 63, 65, 74, 80, 83, 86, 105, 109, 126, 130 


50 


10, 14, 17, 19, 20, 23, 25, 33, 34, 36, 39, 40, 43, 46, 52, 53, 
72, 73, 75, 78, 79, 82, 88, 90, 95, 99, 104, 110, 113, 115, 117, 
118, 119, 123, 125, 129, 132, 133, 136, 140 


51 


3, 5, 7, 8, 22, 26, 27, 29, 38, 42, 45, 48, 57, 60, 62, 64, 68, 69, 
71, 77, 92, 93, 94, 96, 98, 108, 111, 116, 120, 121, 124, 127, 128, 
131, 134, 139, 141 


52 


6, 12, 13, 21, 58, 59, 61, 66, 70, 102, 107, 112, 114, 137, 138, 145 


53 


4, 18, 44, 49, 50, 97, 100, 101, 103, 142 


54 


24, 28, 30, 41, 56, 67, 87, 122, 135, 143 


55 


89, 106 


56 


11, 84, 91 


57 


35,144 


58 


51 


59 



Table 4: Maximal length of the chains of identical bits after the rounding bit (assuming 
the target precision is double precision) in the worst cases for n from 3 to 145. 

Define a breakpoint as the exact middle of two consecutive double precision numbers. 
RNd (h + I) will be equal to RNd (x n ) if and only if there is no breakpoint between x n 
and h + l. 

The worst case obtained shows that if x is a double-precision number, and if 3 < 
n < 145, then the significand y of x 51 is always at a distance larger than 2~ 113 from the 
breakpoint fi (see Figured]) where the distance \y — > 2~( 53+59+1 ) = 2~ 113 . 



a 2~ 52 n = {a + 1)2" 52 (a + 1)2~ 52 

Figure 1: Position of the hardest to round case y = x 51 within rounding interval 
[a2 -52 ; (a + 1)2 -52 ] with breakpoint \i — (a + ^)2 -52 , for significand defined by integer a. 

We know that the significand of h + I is within 2a max from that of x n , where a max 
(as given by its binary logarithm) is listed in Table [21 For all values of n less than or 
equal to 145, we have 2a max < 2~ 113 , thus RN d (h + l) = RN d (x n ). We therefore get the 
following result: 

Theorem 7 If algorithm LogPower is run in double- extended precision, and if 3 < n < 
145, then RNd (h + l) = RNd (x n ): Hence by rounding h + l to the nearest double-precision 
number, we get a correctly rounded result. 

Now, two important remarks: 

• We do not have the worst cases for n > 145, but from probabilistic arguments we 
strongly believe that the lengths of the largest chains of consecutive bits after the 
rounding bit will be of the same order of magnitude (i.e., around 50) for some range 
of n above 145. However, it is unlikely that we will be able to show correct rounding 
in double precision for values of n larger than 1000. 
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• On an Intel Itanium processor, it is possible to directly add two double-extended pre- 
cision numbers and round the result to double precision without a "double rounding" 
(i.e., without having an intermediate sum rounded to double-extended precision). 
Hence Theorem [7] can directly be used. It is worth being noticed that the draft re- 
vised standard IEEE 754-R (see |http : / /754r . ucbtest . org/p includes the fma as 
well as rounding to any specific destination format, independent of operand formats. 

Conclusion 

It has been shown that the function x n can be calculated in time (9(logn) with correct 
rounding in double precision, employing double-extended precision arithmetic, at least 
for the range 3 < n < 145. A fused multiply accumulate (fma) instruction is assumed 
available for algorithm efficiency reasons; and to keep the analysis simple, it was assumed 
that the input as well as the output are not subnormal numbers and are below the overflow 
threshold. 

A simpler, 0(n) time algorithm, faster than the above for small values of n, was 
also analyzed. However, its error analysis turned out to be more complicated (and less 
rigorous), and also to be less accurate than the other. 
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